- Title
- Graph-based Incident Aggregation for Large-Scale Online Service Systems
- Creator
- Chen, Zhuangbin; Liu, Jinyang; Su, Yuxin; Zhang, Hongyu; Wen, Xuemin; Ling, Xiao; Yang, Yongqiang; Lyu, Michael R.
- Relation
- 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). Proceedings of the 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) (Melbourne, Australia 15-19 November, 2021) p. 430-442
- Relation
- ARC.DP200102940 http://purl.org/au-research/grants/arc/DP200102940
- Publisher Link
- http://dx.doi.org/10.1109/ASE51524.2021.9678746
- Publisher
- Institute of Electrical and Electronics Engineers (IEEE)
- Resource Type
- conference paper
- Date
- 2021
- Description
- As online service systems continue to grow in terms of complexity and volume, how service incidents are managed will significantly impact company revenue and user trust. Due to the cascading effect, cloud failures often come with an overwhelming number of incidents from dependent services and devices. To pursue efficient incident management, related incidents should be quickly aggregated to narrow down the problem scope. To this end, in this paper, we propose GRLIA, an incident aggregation framework based on graph representation learning over the cascading graph of cloud failures. A representation vector is learned for each unique type of incident in an unsupervised and unified manner, which is able to simultaneously encode the topological and temporal correlations among incidents. Thus, it can be easily employed for online incident aggregation. In particular, to learn the correlations more accurately, we try to recover the complete scope of failures' cascading impact by leveraging fine-grained system monitoring data, i.e., Key Performance Indicators (KPIs). The proposed framework is evaluated with real-world incident data collected from a large-scale online service system of Huawei Cloud. The experimental results demonstrate that GRLIA is effective and outperforms existing methods. Furthermore, our framework has been successfully deployed in industrial practice.
- Subject
- cloud computing; onling service systems; incident management; graph representation learning
- Identifier
- http://hdl.handle.net/1959.13/1435449
- Identifier
- uon:39723
- Identifier
- ISBN:9781665403375
- Language
- eng
- Reviewed
- Hits: 1816
- Visitors: 1813
- Downloads: 0